The goal of this project is to pull together a database of the various NYC transportation options to ease and expedite future analyses. Currently, the database is approximately 20gb representing 140mm rows, and can be created using the Create_database.R script.
Create_database.R creates the SQLite database of the Citi Bike, Subway, and (eventually) the Taxi data. Shell scripts in each folder must be run first to download the data. Individual files to clean and analyze the data from the database are in the folders: Citi-bike, Subway-turnstiles, Taxi:
Citi-bike.RSubway_turnstiles.RTaxi.ROnce the database is created, data can easily be accessed via SQL and dbplyr queries:
# establish the connection to the database
conn <- dbConnect(RSQLite::SQLite(), "NYC.db")
# query and mutate on-disk
turnstile.df <- tbl(conn, "turnstile.2019")
turnstile.df %>%
select(Station, Time, Entries, Exits) %>%
group_by(Station) %>%
summarize(Entries = sum(Entries),
Exits = sum(Exits))
# or pull data into memory and then treat as a standard data frame
turnstile.df <- tbl(conn, "turnstile.2019") %>% collect()